Trendings in Spanish-American "Top 50 tracks" Playlists

Spotify, which includes more than 16 million tracks and podcasts, is one of the wolrdwide most used music streaming platforms. This makes it a very relevant reference to understands the trends of what happen over time. Therefore, in this notebook we are going to make an in-depth analysis of spanish-american trending tracks up to November 14, 2024. The used dataset contains tracks from "Top-50" playlists of 17 different spanish-american countries: Colombia, Mexico, Spain, Argentina, Venezuela, Chile, Ecuador, Dominican Republic, Peru, Panama, Uruguay, Paraguay, Bolivia, Costa Rica, Guatemala, Honduras, and El Salvador.

The data was obtained from the Spotify API.

Import libraries

In [17]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import plotly.express as px
import plotly.graph_objects as go
import plotly.io as pio
from plotly.subplots import make_subplots
import seaborn as sns
import datetime as dt
pio.renderers.default = 'notebook'

Loading Data

In [18]:
pd.set_option('display.max_columns',None)
tracks_df = pd.read_excel('top_tracks_latin_playlists.xlsx',sheet_name='tracks')
tracks_df.head()
Out[18]:
id name explicit release_date artist popularity playlist danceability energy key loudness mode speechiness acousticness instrumentalness liveness valence tempo duration_ms time_signature
0 2btNsI4OvcVl7SAHQQDHFB Mirame True 2024-04-17 Blessd 87 Top-50-Colombia 0.717 0.656 7 -4.449 1 0.0797 0.141 0.000030 0.0661 0.695 175.956 157453 4
1 6WatFBLVB0x077xWeoVc2k Si Antes Te Hubiera Conocido False 2024-06-21 KAROL G 95 Top-50-Colombia 0.924 0.668 11 -6.795 1 0.0469 0.446 0.000594 0.0678 0.787 128.027 195824 4
2 13BDiikG6y5o5cQTK0HpW6 Soltera - W Sound 01 True 2024-08-06 W Sound 79 Top-50-Colombia 0.734 0.578 1 -4.147 1 0.2950 0.155 0.000242 0.1130 0.880 199.997 142022 4
3 7bywjHOc0wSjGGbj04XbVi LUNA False 2023-12-01 Feid 89 Top-50-Colombia 0.774 0.860 7 -2.888 0 0.1300 0.131 0.000000 0.1160 0.446 100.019 196800 4
4 5QjmUqgpPQgXgg4606DqZF UWAIE False 2024-08-15 Kapo 88 Top-50-Colombia 0.705 0.783 9 -4.783 0 0.0403 0.138 0.000000 0.0984 0.454 103.001 172427 4

Data Information

In [19]:
tracks_df.shape
#(Rows, columns)
Out[19]:
(849, 20)
In [20]:
tracks_df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 849 entries, 0 to 848
Data columns (total 20 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   id                849 non-null    object 
 1   name              849 non-null    object 
 2   explicit          849 non-null    bool   
 3   release_date      849 non-null    object 
 4   artist            849 non-null    object 
 5   popularity        849 non-null    int64  
 6   playlist          849 non-null    object 
 7   danceability      849 non-null    float64
 8   energy            849 non-null    float64
 9   key               849 non-null    int64  
 10  loudness          849 non-null    float64
 11  mode              849 non-null    int64  
 12  speechiness       849 non-null    float64
 13  acousticness      849 non-null    float64
 14  instrumentalness  849 non-null    float64
 15  liveness          849 non-null    float64
 16  valence           849 non-null    float64
 17  tempo             849 non-null    float64
 18  duration_ms       849 non-null    int64  
 19  time_signature    849 non-null    int64  
dtypes: bool(1), float64(9), int64(5), object(5)
memory usage: 127.0+ KB
id: Track's unique identificator.
name: Track's name.
explicit: Wheter or not the tracks has explicit lyrics. True: yes, it has explicit lyrics. False: no, it does not have explicit lyrics
release_date: Track's first release date.
artist: Artist who performed in the track.
popularity: The popularity of the track.The popularity is calculated by algorithm and is based, in the most part, on the total number of plays the track has had and how recent those plays are. 0 - 100, from least popular to most popular.
playlist: Playlist where the track is extracted.
danceability: How suitable the track is for dancing based on a combination of musical elements including tempo, rhythm stability, beat strength, and overall regularity. 0 - 1, from least danceable to most danceable.
energy: perceptaul measure of intensity and activity (fast, loud, noisy). 0 - 1, from least energy to most energy.
key: The key the track is in. Integers map to pitches using standard Pitch Class notation. E.g. 0 = C, 1 = C♯/D♭, 2 = D, and so on. If no key was detected, the value is -1.
loudness: The overall loudness of a track in decibels (dB). Values typically range between -60 and 0 db.
mode: Track's modality (major or minor). Major is 1 and minor is 0.
speechiness: Presence of spoken words detected in the track. 0 - 1, from least speech-lik to most speech-like.
acousticness: Confidence measure whether the track is acoustic. Thye confidences ranges between 0 to 1 instrumentalness: Predicts whether the tracks contains no vocals. 0 - 1, The closer the instrumentalness value is to 1.0, the greater likelihood the track contains no vocal content.
liveness: Detects the presence of an audience in the recording. Higher liveness values represent an increased probability that the track was performed live. A value above 0.8 provides strong likelihood that the track is live.
valence: Describes the musical positiveness conveyed by the track. 0 - 1, from least positive to most positive.
tempo: Track's overall beats per minute (BPM) estimated.
duration_ms: Track's duration in miliseconds.
time_signature: An estimated time signature. 1 - 5, from "3/4" to "7/4".

References:
https://developer.spotify.com/documentation/web-api/reference/get-playlist
https://developer.spotify.com/documentation/web-api/reference/get-audio-features

Exploration

In [21]:
tracks_df.isnull().sum()
Out[21]:
id                  0
name                0
explicit            0
release_date        0
artist              0
popularity          0
playlist            0
danceability        0
energy              0
key                 0
loudness            0
mode                0
speechiness         0
acousticness        0
instrumentalness    0
liveness            0
valence             0
tempo               0
duration_ms         0
time_signature      0
dtype: int64
In [22]:
tracks_df.describe()
Out[22]:
popularity danceability energy key loudness mode speechiness acousticness instrumentalness liveness valence tempo duration_ms time_signature
count 849.000000 849.000000 849.000000 849.000000 849.000000 849.000000 849.000000 849.000000 849.000000 849.000000 849.000000 849.000000 849.000000 849.000000
mean 79.511190 0.740389 0.691458 5.931684 -5.303485 0.415783 0.101737 0.252781 0.003004 0.176362 0.640787 118.520093 186919.494700 3.859835
std 10.023213 0.112127 0.116101 3.179392 1.914944 0.493147 0.084900 0.192217 0.021944 0.153093 0.184413 30.101113 38599.784036 0.432086
min 0.000000 0.373000 0.356000 0.000000 -14.178000 0.000000 0.026600 0.000856 0.000000 0.039000 0.109000 53.376000 68364.000000 1.000000
25% 77.000000 0.686000 0.611000 4.000000 -6.304000 0.000000 0.046900 0.103000 0.000000 0.095700 0.494000 97.541000 158242.000000 4.000000
50% 82.000000 0.753000 0.702000 6.000000 -4.945000 0.000000 0.065400 0.187000 0.000000 0.121000 0.658000 104.977000 182293.000000 4.000000
75% 85.000000 0.820000 0.767000 9.000000 -4.089000 1.000000 0.129000 0.372000 0.000030 0.197000 0.791000 131.842000 207747.000000 4.000000
max 100.000000 0.952000 0.965000 11.000000 0.020000 1.000000 0.625000 0.897000 0.287000 0.949000 0.974000 214.047000 390545.000000 5.000000
In [23]:
tracks_df[tracks_df.select_dtypes(include=['bool']).columns] = tracks_df.select_dtypes(include=['bool']).astype(int)
numeric_tracks_df = tracks_df.select_dtypes(include=['number'])
correlation = numeric_tracks_df.corr(method='kendall').round(2)
mask = np.zeros_like(correlation, dtype = bool)
mask[np.triu_indices_from(mask)] = True
correlation_viz = correlation.mask(mask).dropna(how='all')
correlation_matrix = px.imshow(correlation_viz, text_auto= True, height=600, color_continuous_scale=['#ffffff','#1db954'], aspect='equal', title='<b>Correlation Between Columns')
correlation_matrix.update_layout(title_x=0.5, title_font_size = 24, margin_pad = 5, font_size = 10, font=dict(color='#535353'), yaxis=dict(range=[0,100]))
We can see by difference the highest correlation is between "Loudness" and "Energy" (0.4). The next two highest are between "Time Signature" and "Loudness" (0.23), and between "Danceability" and "Explicit" (0.21). We are going to take a look of this three correlations later
In [24]:
tracks_df['name_artist'] = tracks_df['name'] + ' - ' + tracks_df['artist']
top_popular_tracks_fig = px.treemap(tracks_df.drop_duplicates('name_artist').sort_values(by='popularity',ascending=False).head(20),path=[px.Constant('Tracks'),'name_artist'],values='popularity',color='popularity',color_continuous_scale=['#8cfab1','#1db954'])
top_popular_tracks_fig.update_traces(hovertemplate='<b>%{label}</b><br>Popularity: %{value}<extra></extra>',textinfo='label+value',textposition='middle center',textfont_color='#121212',marker=dict(cornerradius=5,line_color='#535353'))
top_popular_tracks_fig.update_layout(title='<b>Top 20 Popular Tracks</b>',title_x=0.5,font_color='#535353',title_font_size=24,hoverlabel_font_color='#535353',margin=dict(b=10))
In general, there is not much variability in trending track's popularity. Altought, the top four popular tracks are a little bit isolated from the rest, which means that have high difference in the total number the tracks have been recently played.
In [25]:
artist_popularity = tracks_df.groupby(by='artist',as_index=False).agg(average_popularity=('popularity','mean'),artist_tracks=('name','nunique')).sort_values(by='average_popularity',ascending=False).sort_values(by='artist_tracks', ascending=False).head(20)
popularity_fig = go.Figure(data=go.Bar(x=artist_popularity['artist'],y=artist_popularity['artist_tracks'], name= 'Artist Tracks', text=artist_popularity['artist_tracks'], marker=dict(color='#1db954'), hovertemplate='<b>Artist:</b> %{x}<br><b>Tracks:</b> %{y}<extra></extra>'))
popularity_fig.add_trace(go.Scatter(x=artist_popularity['artist'],y=artist_popularity['average_popularity'], name= 'Artists Average Popularity', text=artist_popularity['average_popularity'].round(1), textposition='top center', mode='lines+markers+text', line=dict(color='#535353'), hovertemplate='<b>Artist:</b> %{x}<br><b>Average Popularity:</b> %{y:.1f}<extra></extra>',yaxis='y2'))
popularity_fig.update_layout(title = "<b>Artist's Trending Tracks vs Average Popularity",title_x = 0.5, title_font_size = 24,font_color='#535353',xaxis=dict(title='<b>Top 20 Artists'),yaxis=dict(title=dict(text="<b># of Trending Tracks"),side="left"),yaxis2=dict(title=dict(text="<b>Average Popularity"),side="right",overlaying="y",tickmode="sync"),legend= dict(orientation='h',xanchor='center',x=0.5,yanchor='top',y=1.1))
Karol G has a high popularity (88.5) with just four trending tracks, clearly due to fact that she has a great recognition. While Engel Montaz, even if has seven trending tracks, has a lower popularity (50.3). This could be due to his tracks are placed in the lowest positions in the "Top-50" playlists.
In [26]:
explicit_general = tracks_df['explicit'].value_counts(normalize=True) * 100
explicit_general = explicit_general.set_axis(['Not Explicit','Explicit'])
trace_general_explicit = go.Pie(labels=explicit_general.index,values=explicit_general.values,marker=dict(colors=['#535353', '#1db954']),showlegend=False,textinfo='label+percent',hovertemplate='<b>%{label}</b><br>%{percent}<extra></extra>')

playlists_grouped = tracks_df.groupby('playlist')['explicit'].value_counts(normalize=True).unstack(fill_value=0).sort_values(by=0,ascending=False) * 100
playlists_grouped.columns = ['Not Explicit','Explicit']
trace_explicit = go.Bar(x=playlists_grouped['Explicit'],y=playlists_grouped.index,name='Explicit',text=playlists_grouped['Explicit'].apply(lambda x: f"{x:.0f}%"),insidetextanchor='middle',hovertemplate='<b>Explicit:</b> %{text}<extra></extra>',marker=dict(color='#1db954'),orientation='h', width=0.9)
trace_not_explicit = go.Bar(x=playlists_grouped['Not Explicit'],y=playlists_grouped.index,name='Not Explicit',text=playlists_grouped['Not Explicit'].apply(lambda x: f"{x:.0f}%"),insidetextanchor='middle',hovertemplate='<b>Not Explicit:</b> %{text}<extra></extra>',marker=dict(color='#535353'),orientation='h', width=0.9)

explicit_figs = make_subplots(rows=1,cols=2,specs=[[{'type': 'domain'}, {'type': 'xy'}]],subplot_titles=['<b>General Explicit vs Not Explicit Tracks','<b>Explicit vs Not Explicit Tracks by Playlist'],horizontal_spacing=0.2)
explicit_figs.add_trace(trace_general_explicit,row=1,col=1)
explicit_figs.add_trace(trace_explicit,row=1,col=2)
explicit_figs.add_trace(trace_not_explicit,row=1,col=2)
explicit_figs.update_layout(font_color='#535353',barmode='stack',xaxis=dict(range=[0,100]),showlegend=False,margin=dict(t=40,b=10))
In general, spanish-american people tend to listen tracks with non-explicit lyrics (59.1%). However, specifically in Chile, they tend to mostly listen explicit tracks (58%), as opposed to Argentina (18%). In addition, Chile is the only country where they listen to more explicit than non-explicit tracks.
In [27]:
analysis_features = ['explicit','danceability','energy','loudness','time_signature']
features_hists = make_subplots(rows=2,cols=3,subplot_titles=analysis_features)
for i, feature in enumerate(analysis_features):
    row = i // 3 + 1
    col = i % 3 + 1
    features_hists.add_trace(go.Histogram(x=tracks_df[feature],name=feature,marker=dict(color='#1db954'),hovertemplate='(%{x}, %{y})<extra></extra>'),row=row,col=col)
features_hists.update_layout(font_color='#535353',title_text='<b>Selected Features Distribution',title_x=0.5,showlegend=False)
features_hists
These are the features that come from the three highest correlations, as mentioned above. Here we can see that "danceability", "energy", and "loudness" have a left-skewed distribution. This means that very few trendings tracks are not very moving.
In [ ]:
analysis_correlations = [('loudness','energy'),('time_signature','loudness'),
                ('explicit','danceability')]
x,y = analysis_correlations[0]
corr_scatter = px.scatter(tracks_df,x=x,y=y,color=x,trendline='ols',trendline_color_override='#1db954',color_continuous_scale=['#535353','#1db954'],title=f'<b>{x.capitalize()} vs {y.capitalize()}')
corr_scatter.update_layout(font_color='#535353',title_x=0.5)
When a track has more dBs, it is percieved as being faster and louder.
In [29]:
x,y = analysis_correlations[1]
corr_scatter = px.box(tracks_df,x=x,y=y,color=x,color_discrete_sequence=['#1db954','#1db954'],title=f'<b>{x.capitalize()} vs {y.capitalize()}')
corr_scatter.update_layout(font_color='#535353',title_x=0.5,showlegend=False)
We can´t see much here, due to the most of the tracks have a stimated 6/4 time signature. Even so, we can see that a large portion of these have a high dB level (meadian = -4.783) , and variability (max = 0.02, min = -14.178).
In [30]:
x,y = analysis_correlations[2]
temp_tracks_df = tracks_df
temp_tracks_df['explicit'] = temp_tracks_df['explicit'].replace({1:'Explicit',0:'Not Explicit'})
corr_scatter = px.box(temp_tracks_df,x=x,y=y,color=x,color_discrete_sequence=['#1db954','#535353'],title=f'<b>{x.capitalize()} vs {y.capitalize()}')
corr_scatter.update_layout(font_color='#535353',title_x=0.5,showlegend=False)
Tracks with explicit lyrics are more suitable for dancing (median = 0.786) than non-explicit ones (median = 0.73). However, non-explicit tracks have a higher possibility of not being suitable for dancing to the public (q1 = 0.635, min = 0.373).

Conclusions

We can conclude that:
- The top three trending tracks are: "Die with a Smile - Lady Gaga", "Birds of a Feather - Billie Eilish", and "Si Antes te Hubiera Conocido - Karol G".
- Karol G is the artist that has the highest average popularity, with just four tracks. while Engel Montaz has the lowest average, with seven trending tracks.
- Most of the trending tracks in spanish-american countries have non-explicit lyrics. However, in Chile most of trending tracks have explicit lyrics (and the only country with this tendency).
- The higher the dB level, the higher the track's energy.
- Most of the trending tracks tend to have a 6/4 time signature.
- Tracks with explicit lyrics tend to be suitable for dancing.